Content based web spam detection using naive bayes with different feature representation technique
نویسندگان
چکیده
Web Spam Detection is the processing to organize the search result according to specified criteria. Most often this refers to the automatic processing of search result, but the term also applies to the automatic classification of search results into ham and spam. Our work also evaluates change in performance by using different representation for the document vector like term frequency (TF), Binary, inverse document frequency (IDF) and TF-IDF. There are various Benchmark Datasets available for researchers related to web spam filtering. There has been significant effort to generate public benchmark datasets for antiweb spam filtering. One of the main concerns is how to protect the privacy of the users whose ham links are included in the datasets. We perform a statistical analysis of a large collection of WebPages, focusing on spam detection. Dimension reduction is important part of classification because it provides ease to visualize high dimensional data. This work reduce dimension of training data in 2D and full and mapped training and test data in to vector space. There are several classification here we use Naive Bayes classification and train data set with varying different representation and testing perform with different spam ham ratio Key-Words: Content spam, keyword count, variety, density and Hidden or invisible text
منابع مشابه
Content-based Dynamic Email Spam Detecting Using Fuzzy Granular Computing Approach
Spam detection is a significant problem which is considered by many researchers by various developed strategies. The best and main spam detection technique should consider and scan the content of the messages to find spam. This research concerns the development of the certain category of granular computing as a classifier for spam detection. In this research, Fuzzy Granular Computing Classifica...
متن کاملSpam Detection System Combining Cellular Automata and Naïve Bayes Classifier
In this study, we focus on the problem of spam detection. Based on a cellular automaton approach and naïve Bayes technique which are built as individual classifiers we evaluate a novel method combining multiple classifiers diversified both by feature selection and different classifiers to determine whether we can more accurately detect Spam. This approach combines decisions from three cellular ...
متن کاملEvolutionary Symbiotic Feature Selection for Email Spam Detection
This work presents a symbiotic filtering approach enabling the exchange of relevant word features among different users in order to improve local anti-spam filters. The local spam filtering is based on a ContentBased Filtering strategy, where word frequencies are fed into a Naive Bayes learner. Several Evolutionary Algorithms are explored for feature selection, including the proposed symbiotic ...
متن کاملEmail Representation using Noncharacteristic Information and its Application
Focusing on the uncertainty of classifying emails based-on email content and the incompleteness of email representation, the paper proposes a new representation using noncharacteristic information. The new approach refers to the whole email, contains feature items extracted from email content, and noncharacteristic items extracted from email header. In the expriment, we adopt Naïve Bayes classi...
متن کاملWeb Spam Detection Using Machine Learning in Specific Domain Features
In the last few years, as Internet usage becomes the main artery of the life's daily activities, the problem of spam becomes very serious for internet community. Spam pages form a real threat for all types of users. This threat proved to evolve continuously without any clue to abate. Different forms of spam witnessed a dramatic increase in both size and negative impact. A large amount of E-mail...
متن کامل